imitating policy and environment
Error Bounds of Imitating Policies and Environments
Imitation learning trains a policy by mimicking expert demonstrations. Various imitation methods were proposed and empirically evaluated, meanwhile, their theoretical understanding needs further studies. In this paper, we firstly analyze the value gap between the expert policy and imitated policies by two imitation methods, behavioral cloning and generative adversarial imitation. The results support that generative adversarial imitation can reduce the compounding errors compared to behavioral cloning, and thus has a better sample complexity. Noticed that by considering the environment transition model as a dual agent, imitation learning can also be used to learn the environment model. Therefore, based on the bounds of imitating policies, we further analyze the performance of imitating environments. The results show that environment models can be more effectively imitated by generative adversarial imitation than behavioral cloning, suggesting a novel application of adversarial imitation for model-based reinforcement learning. We hope these results could inspire future advances in imitation learning and model-based reinforcement learning.
Review for NeurIPS paper: Error Bounds of Imitating Policies and Environments
Weaknesses: The approach used to compare BC seems out of date as there are many more recent approaches that address the compounding error problem and have been shown to achieve better results in imitation learning, such as maxent IRL and other more recent methods in IRL. Reactive policy matching can lead to bad states which is why value iteration and Q learning are used to estimate state-action pairs in terms of future cumulative rewards. For example, Max Ent IRL aims uses value iteration to capture feature preferences of the experts to model their behavior in unseen settings with the explicit goal of maximizing the distribution of the policy through near optimal state-action pairs rather than matching the expert policy directly. The learned policies may differ from expert demonstrations as long as the state-action features being optimized by the learner are similar to those of the demonstrator. In other words, there are multiple near-optimal policies that would be acceptable and are still indicative of the demonstrated behavior which are learned by this method reducing the effect of compounding errors and allowing for deviations from the expert state distribution.
Error Bounds of Imitating Policies and Environments
Imitation learning trains a policy by mimicking expert demonstrations. Various imitation methods were proposed and empirically evaluated, meanwhile, their theoretical understanding needs further studies. In this paper, we firstly analyze the value gap between the expert policy and imitated policies by two imitation methods, behavioral cloning and generative adversarial imitation. The results support that generative adversarial imitation can reduce the compounding errors compared to behavioral cloning, and thus has a better sample complexity. Noticed that by considering the environment transition model as a dual agent, imitation learning can also be used to learn the environment model.